{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/nvidia_tensorrt.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nvidia TensorRT-LLM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.\n",
    "\n",
    "[TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TensorRT-LLM Environment Setup\n",
    "Since TensorRT-LLM is a SDK for interacting with local models in process there are a few environment steps that must be followed to ensure that the TensorRT-LLM setup can be used. Please note, that Nvidia Cuda 12.2 or higher is currently required to run TensorRT-LLM.\n",
    "\n",
    "In this tutorial we will show how to use the connector with GPT2 model.\n",
    "For the best experience, we recommend following\n",
    "[Installation](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0?tab=readme-ov-file#installation) process on the\n",
    "official [TensorRT-LLM Github](https://github.com/NVIDIA/TensorRT-LLM).\n",
    "\n",
    "The following steps are showing how to set up your model with TensorRT-LLM v0.8.0 for x86_64 users.\n",
    "\n",
    "1. Obtain and start the basic docker image environment.\n",
    "```\n",
    "docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04\n",
    "```\n",
    "\n",
    "2. Install dependencies, TensorRT-LLM requires Python 3.10\n",
    "```\n",
    "apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget\n",
    "```\n",
    "3. Install the latest stable version (corresponding to the release branch) of TensorRT-LLM. We are using version 0.8.0, but for the most up to date release,\n",
    "please refer to [official release page](https://github.com/NVIDIA/TensorRT-LLM/releases).\n",
    "```\n",
    "pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com\n",
    "```\n",
    "\n",
    "4. Check installation\n",
    "```\n",
    "python3 -c \"import tensorrt_llm\"\n",
    "```\n",
    "The above command should not produce any errors.\n",
    "\n",
    "5. For this example we will use GPT2. The GPT2 model files need to be created via scripts following the instructions [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt#usage)\n",
    "    * First, inside the container, we've started during stage 1, clone TensorRT-LLM repository:\n",
    "    ```\n",
    "    git clone --branch v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git\n",
    "    ```\n",
    "    * Install requirements for GPT2 model with:\n",
    "    ```\n",
    "    cd TensorRT-LLM/examples/gpt/ && pip install -r requirements.txt\n",
    "    ```\n",
    "    * Download hf gpt2 model\n",
    "    ```\n",
    "    rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2\n",
    "    cd gpt2\n",
    "    rm pytorch_model.bin model.safetensors\n",
    "    wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin\n",
    "    cd ..\n",
    "    ```\n",
    "    * Convert weights from HF Transformers to TensorRT-LLM format\n",
    "    ```\n",
    "    python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16\n",
    "    ```\n",
    "    * Build TensorRT engine\n",
    "    ```\n",
    "    python3 build.py --model_dir=./c-model/gpt2/1-gpu --use_gpt_attention_plugin --remove_input_padding\n",
    "    ```\n",
    "  \n",
    "6. Install `llama-index-llms-nvidia-tensorrt` package\n",
    "  ```\n",
    "  pip install llama-index-llms-nvidia-tensorrt\n",
    "  ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic Usage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Call `complete` with a prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "from llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM\n",
    "\n",
    "llm = LocalTensorRTLLM(\n",
    "    model_path=\"./engine_outputs\",\n",
    "    engine_name=\"gpt_float16_tp1_rank0.engine\",\n",
    "    tokenizer_dir=\"gpt2\",\n",
    "    max_new_tokens=40,\n",
    ")\n",
    "\n",
    "resp = llm.complete(\"Who is Harry Potter?\")\n",
    "print(str(resp))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The expected response should look like:\n",
    "```\n",
    "Harry Potter is a fictional character created by J.K. Rowling in her first novel, Harry Potter and the Philosopher's Stone. The character is a wizard who lives in the fictional town#\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  },
  "vscode": {
   "interpreter": {
    "hash": "a0a0263b650d907a3bfe41c0f8d6a63a071b884df3cfdc1579f00cdc1aed6b03"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
